I decided to conduct my Week 9 research project on exploring COVID statistics and its correlation with new indicators of government response or measures taken against COVID virus spreading.
is there a significant correlation between cumulative cases and deaths from COVID-19?
is there a significant correlation between government response index and government health expenditure in previous years?
is there strong positive correlation between confirmed COVID-19 cases and government response index over time?
yes, there is significant positive correlation between deaths from COVID-19 and maximum government response;
no, strict response to COVID-19 in 2020 and better funding healthcare systems in previuos years are not significantly correlated.
yes, correlation between confirmed cases and government response over time is strong.
To explore COVID statistics I use COVID-19 Data Repository by the Center for Systems Science and Engineering at Johns Hopkins University, or JHU CSSE COVID-19 Dataset (here - Dataset 1) accessing it via API.
To explore new indicators of government response (measures taken against COVID virus spreading) I use Oxford Covid-19 Government Response Tracker (here - Dataset 2).
Part I of this project presented as chapter "Dataset 1: Johns Hopkins University & Medicine (JHU)" with paragraphs 1 to 4, where I add some improvements to it using Dataset 3 in order to have new structures in the dataset and new insights in research.
Part II is presented as "Dataset 2: Oxford Covid-19 Government Response Tracker" and paragraph 5, and it shows the aggregated dataset and correlation analysis.
Timeseries from January 22, 2020 to August 20, 2020 are available for downloading: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series
Raw data links to cumulative data:
Confirmed cases: https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
import warnings
warnings.filterwarnings('ignore')
The GIS technologies have played an important role in many aspects, including the data integration, and geospatial visualization of epidemic information, spatial tracking of confirmed cases, prediction of regional transmission, and many more. These provide support information for government sectors to fight against the COVID-19 spreading.
The Center for Systems Science and Engineering (CSSE) at Johns Hopkins University & Medicine (JHU) had provided the dashboard created with ESRI ArcGIS operation dashboard (https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6). But feature for visualizing the change of data overtime on the map is missing. Later JHU created animated map on confirmed cases here https://coronavirus.jhu.edu/data/animated-world-map , separately from the dashboard, but the users can only observe the map changing colors, they have no access to view the actual numbers or zoom in the map, as it is not interactive and does not show the actual data.
So I decided to create animated maps to explore data changes over time. In order to do that my current dataset structure should be changed. Now the data structure is that every day's statistics is a separate column, so the values are "scattered" in unique cells for each day and country; I will move all the values to a single "value" column, and move all days labels from columns names to single "Date" column. It will transform the dataset to its long variation with repeating country rows and date rows.
import pandas as pd #to work with tabular data
import pycountry #to get the three-letter country codes ISO 3166–1 for each country
df_cases=pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")
# Aggregate the dataset
df_cases = df_cases.drop(columns=['Province/State','Lat','Long'])
df_cases = df_cases.groupby('Country/Region').agg('sum')
date_list = list(df_cases.columns)
# Get the country codes for each country
#list(pycountry.countries) #uncomment to load the list of available data
def get_country_code(name):
"""
Return ISO-3 letter code for country by its name;
Return None if name is not found in the pycountry.countries
"""
try:
return pycountry.countries.lookup(name).alpha_3
except:
return None
df_cases['Country'] = df_cases.index
Several countries' names are written differently than pycountry expects, so I change their names to match and get the code
df_cases.loc[df_cases.Country=="Burma",'Country']='Myanmar'
df_cases.loc[df_cases.Country=="Brunei",'Country']='Brunei Darussalam'
df_cases.loc[df_cases.Country=="Iran",'Country']='Iran, Islamic Republic of'
df_cases.loc[df_cases.Country=="Congo (Brazzaville)",'Country']='Congo, The Democratic Republic of the'
df_cases.loc[df_cases.Country=="Congo (Kinshasa)",'Country']='Republic of the Congo'
df_cases.loc[df_cases.Country=="Cote d'Ivoire",'Country']="Côte d'Ivoire"
df_cases.loc[df_cases.Country=="Korea, South",'Country']="Korea, Republic of"
df_cases.loc[df_cases.Country=="Syria",'Country']="Syrian Arab Republic"
df_cases.loc[df_cases.Country=="Taiwan*",'Country']="Taiwan, Province of China"
df_cases.loc[df_cases.Country=="Russia",'Country']='Russian Federation'
df_cases.loc[df_cases.Country=="West Bank and Gaza",'Country']='Palestine, State of'
df_cases.loc[df_cases.Country=="Venezuela",'Country']='Venezuela, Bolivarian Republic of'
df_cases.loc[df_cases.Country=="US",'Country']='United States'
df_cases.loc[df_cases.Country=="Laos",'Country']="Lao People's Democratic Republic"
As soon as names are unified, I can add their ISO-3 codes.
print(len(df_cases['Country'].tolist()))
df_cases['ISO-3'] = df_cases['Country'].apply(get_country_code)
# Transform the dataset in a long format
df_cases = pd.melt(df_cases, id_vars=['Country','ISO-3'], value_vars=date_list)
# Rename columns
df_cases = df_cases.rename(columns={"variable": "Date","value": "Value"})
#df_cases[0:60] #checking the slice 1 in the dataset on confirmed cases, uncomment to load
There is one "None" value left in the df_confirmed_long[0:60] slice ("Diamond Princess").
#df_cases[60:120] #checking the slice 2 in the dataset on confirmed cases, uncomment to load
#df_cases[120:160] #checking the slice 3 in the dataset on confirmed cases, uncomment to load
#df_cases[160:188] #checking the slice 4 in the dataset on confirmed cases, uncomment to load
There are "None" values for "MS Zaandam", "Holy See" (Vatican) and "Kosovo" left in the df_confirmed_long[60:120] slice. First is not a country, second is too small and excessive to dataset (population is 809 people), but Kosovo is important to show on the map as this European country has population more than 1.8 mln people and 11 thousands of confirmed cases.
The problem is, that "Kosovo" is not listed in pycountry dictionary (although the World Bank added XKX code to Kosovo in June 2017 according to archives https://libraries.acm.org/binaries/content/assets/libraries/archive/world-bank-list-of-economies.pdf), that is why I need to "fix Kosovo" after adding all other codes with apply(get_country_code).
#add ISO-3 code manually as it is not listed in pycountry dictionary
df_cases.loc[df_cases.Country=="Kosovo",'ISO-3']="XKX"
#check ISO-3 for "Kosovo" to make sure the code is applied
df_cases[df_cases['ISO-3']=='XKX'][:5]
Now it is safe to drop "None" values.
df_cases = df_cases.dropna()
df_cases[47:49] #there is no Diamond Princess in the dataset anymore, its index was 48
df_cases[101:104] #there is no MS Zaandam in the dataset anymore, its index was 104
Dataset is cleaned and has data on 188 countries since January 2020 till January 2021.
print(len(df_cases['ISO-3'].unique().tolist()))
print(len(df_cases['ISO-3']))
print((len(df_cases['ISO-3']))==(len(df_cases['Date']))==(len(df_cases['Value'])))
Checking for null
df_cases.isnull().any() # check for NaN
Now I can use Plotly Express to create animated map. The cumulative cases animation shows the total number of cases reported in each country at each point in time, regardless of how many people have recovered. Visualizing cumulative cases demonstrates the overall toll of coronavirus on a country over time.
import plotly.express as px
import numpy as np
df = df_cases
fig = px.choropleth(df, # input dataframe
locationmode='ISO-3', # set of locations used to map 'locations'
locations="ISO-3", # identify country by code
color=np.log10(df['Value']), # identify values and replace linear scale with logarithmic scale
hover_name="Country", # identify column to add as name to hover information
animation_frame="Date", # identify date column
projection="equirectangular", # select projection
hover_data=[df['Value']], # hover text
center = {"lat": 14.883333, "lon": 5.266667},# set map center
color_continuous_scale=px.colors.sequential.Reds, # set color scale, "_r" to reverse color
range_color=[0,round(np.log10(df['Value']).max(),2)], # set the range of dataset
)
#customize layout
fig.update_layout(
title_text='Confirmed cases by country over time<br>January 22, 2020 - January 14, 2021',
geo=dict(showframe=False, showcoastlines=False, projection_type='equirectangular'),
annotations = [dict(x=0.8,y=0.0,xref='paper',yref='paper',
text='Source: <a href="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv">\
Johns Hopkins University & Medicine</a>',
showarrow = False)],
#customize colorbar
coloraxis_colorbar=dict(
title='Confirmed',
tickvals=[0, 1, 2, 3, 4, 5, 6, 6.7], #customize colorbar title and ticks values
ticktext = ['1', '10','100', '1K', '10K', '100K', '1M', '6M'] #replace log10 colorbar ticks text
)
)
fig.show()
fig.write_html("Confirmed_map.html")
The map has "Play" and "Stop" buttons near the Date of observation mark, and allows zooming and observing the number of cases, ISO-3 codes and dates of observation in hover info for each country.
The map reveals later some interesting details. JHU CSSE COVID-19 Dataset does not contain information on confirmed cases in Somaliland (part of Somalia, ISO-3 code of Somalia is "SOM"), North Korea (ISO-3 code is "PRK") and Turkmenistan ("TKM").
Somaliland has declared independence, but is not recognized internationally (hence not in the ISO list), so choropleth module has to particular code to use to map the data. Plotting by country name is also not possible because Somaliland borders are not interationally set and recognized.
North Korea escalates coronavirus response, but extent of outbreak is unclear; there are no confirmed cases of COVID-19 in North Korea, the government has taken extensive measures, including quarantines and travel restrictions. North Korea didn't admit to its 1st case until July, although city of Kaesong has been focus of quarantines. Since the end of December till August, according to unofficial data North Korea has quarantined and released 25,905 people, 382 of them foreigners.
Lack of information is not surprising in the first and the second case, but Turkmenistan is missing for different reasons. There is no official statistics on COVID-19 spread in Turkmenistan at all. The state-controlled media are not allowed to use the word "coronavirus" and it has even been removed from health information brochures distributed in schools, hospitals and workplaces (according to Turkmenistan Chronicle, one of the few sources of independent news, whose site is blocked within the country). Turkmenistan 2020 population is estimated at 6.0 mln people at mid year according to UN data.
print("Max confirmed cases:", df_cases['Value'].max())
As of August 20, 2020 maximum number of cases - 23.1 mln - were confirmed in USA, and there were performed about 261.5 mln tests there.
I would like to see bigger picture for data, not only by country, but also by region and by income level. To make this happen I add region and income level colunms to all countries. I use the World Bank data to create dataframe-converter and merge additional columns to my dataset.
df_convert=pd.read_csv("iso3_region_income_country.csv")
df_convert.head(3)
df_cases=df_cases.merge(df_convert,on='ISO-3')
df_cases.head(1)
df_cases.isnull().any()
Interactive sunburst plot represents hierarchial data as sectors laid out over several levels of concentric rings. Next sunburst graph shows countries within world's regions where the most cases of virus were confirmed. It is United States and Brazil in Americas, India - in South Asia, Russia - in Europe and Central Asia, and South Africa in African continent.
import numpy as np
import plotly.express as px
df = df_cases[df_cases['Date']=='12/31/20'] # take the last day of observation so cumulative values are maximum
fig = px.sunburst(df, path=['Region', 'Country_WB'], values=df.Value,
color=df.Value, color_continuous_scale='Reds',
title = 'Confirmed cases by regions and countries<br>by December 31, 2020',
color_continuous_midpoint=np.average(df.Value,weights=df.Value))
fig.show()
fig.write_html("Confirmed_region&country.html")
Is there a pattern in terms of virus spread between different regions of income? The next sunburst graph shows that 'High income' countries and 'Upper medium income' countries cover 41% and 39% of total COVID-19 cases respectively; "Lower middle income" countries cover less than 19.5% of total COVID-19 cases, and share of cases confirmed in low income countries is about 0.5%.
df = df_cases[df_cases['Date']=='12/31/20'] # take the last day of observation so cumulative values are maximum
fig = px.sunburst(df, path=['IncomeLevel', 'Country_WB'], values=df['Value'],
color=df['Value'], color_continuous_scale='Reds',
color_continuous_midpoint=np.average(df['Value'], weights=df['Value']),
title = 'Confirmed cases by income level and countries<br>by December 31, 2020')
fig.show()
fig.write_html("Confirmed_income&country.html")
At first it could look like there is a correlation, as 80% of cases are confirmed in countries where income level is higher than medium. But it is importnant to note, that the number of confirmed cases is lower than the number of actual cases at all times, the main reason for that is limited testing. On one hand, this especially could make effect on COVID-19 statistics in lower income countries where the virus is harder to diagnosed due to various limitations. On the other hand, low income countries population is less globaly mobile and this factor is probably slowing down the spreading of virus there in comparison with high income countries.
In any case there are lots of controversial effects from different groups of factors and it is too early to make conclusions at this stage given the available statistics.
import pandas as pd
import pycountry
df_deaths=pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")
# Aggregate the dataset
df_deaths = df_deaths.drop(columns=['Province/State','Lat','Long'])
df_deaths = df_deaths.groupby('Country/Region').agg('sum')
date_list = list(df_deaths.columns)
df_deaths['Country'] = df_deaths.index
Several countries' names are written differently than pycountry expects, so I change their names to match and get the code
df_deaths.loc[df_deaths.Country=="Burma",'Country']='Myanmar'
df_deaths.loc[df_deaths.Country=="Brunei",'Country']='Brunei Darussalam'
df_deaths.loc[df_deaths.Country=="Iran",'Country']='Iran, Islamic Republic of'
df_deaths.loc[df_deaths.Country=="Congo (Brazzaville)",'Country']='Congo, The Democratic Republic of the'
df_deaths.loc[df_deaths.Country=="Congo (Kinshasa)",'Country']='Republic of the Congo'
df_deaths.loc[df_deaths.Country=="Cote d'Ivoire",'Country']="Côte d'Ivoire"
df_deaths.loc[df_deaths.Country=="Korea, South",'Country']="Korea, Republic of"
df_deaths.loc[df_deaths.Country=="Syria",'Country']="Syrian Arab Republic"
df_deaths.loc[df_deaths.Country=="Taiwan*",'Country']="Taiwan, Province of China"
df_deaths.loc[df_deaths.Country=="Russia",'Country']='Russian Federation'
df_deaths.loc[df_deaths.Country=="West Bank and Gaza",'Country']='Palestine, State of'
df_deaths.loc[df_deaths.Country=="Venezuela",'Country']='Venezuela, Bolivarian Republic of'
df_deaths.loc[df_deaths.Country=="US",'Country']='United States'
df_deaths.loc[df_deaths.Country=="Laos",'Country']="Lao People's Democratic Republic"
As soon as names are unified, I can add their ISO-3 codes.
df_deaths['ISO-3'] = df_deaths['Country'].apply(get_country_code)
# Transform the dataset in a long format
df_deaths = pd.melt(df_deaths, id_vars=['Country','ISO-3'], value_vars=date_list)
df_deaths.head(0)
df_deaths = df_deaths.rename(columns={"variable": "Date","value": "Value"})
df_deaths.head(0)
#add Kosovo ISO-3 code as it is not listed in pycountry dictionary
df_deaths.loc[df_deaths.Country=="Kosovo",'ISO-3']="XKX"
df_deaths = df_deaths.dropna()
import plotly.express as px
df = df_deaths
fig = px.choropleth(df, # input dataframe
locationmode='ISO-3', # set of locations used to map 'locations'
locations="ISO-3", # identify country by code
color=np.log10(df['Value']), # identify values and replace linear scale with logarithmic scale
hover_name="Country", # identify column to add as name to hover information
animation_frame="Date", # identify date column
projection="equirectangular", # select projection
hover_data=[df['Value']], # hover text
center = {"lat": 14.883333, "lon": 5.266667},# set map center
color_continuous_scale=px.colors.sequential.Reds, # set color scale, "_r" to reverse color
range_color=[0,round(np.log10(df['Value']).max(),2)], # set the range of dataset
)
#customize layout
fig.update_layout(
title_text='Deaths from COVID-19 by country over time<br>January 22, 2020 - January 14, 2021',
geo=dict(showframe=False, showcoastlines=False, projection_type='equirectangular'),
annotations = [dict(
x=0.8,
y=0.0,
xref='paper',
yref='paper',
text='Source: <a href="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv">\
Johns Hopkins University & Medicine</a>',
showarrow = False
)],
#customize colorbar
coloraxis_colorbar=dict(title='Deaths',
tickvals=[0, 1, 2, 3, 4, 5, 5.3010299957], #customize colorbar title and ticks values
ticktext = ['1', '10', '100', '1K', '10K', '100K', '200K'] #replace log10 colorbar ticks text
)
)
fig.show()
fig.write_html("Deaths_map.html")
As of January 2021 maximum number of lethal end cases were registered in USA. Mongolia has reported 2 COVID-19 deaths.
print("Max number of deaths:", df_deaths['Value'].max())
df_deaths = df_deaths.merge(df_convert,on='ISO-3')
df_deaths.head(1)
df_deaths.isnull().any()
Interactive sunburst graph shows countries within world's regions where the lethal end cases of COVID-19 were reported by December 31, 2020.
import numpy as np
import plotly.express as px
# filter the rows for the last day of observation so cumulative values would be maximum and latest
df = df_deaths[df_deaths['Date'] == '12/31/20']
#exclude rows with zero values of 12 countries with no deaths reported
df = df[df['Value'] != 0]
fig = px.sunburst(df, path = ['Region', 'Country'], values = df.Value,
color = df.Value, color_continuous_scale='Reds',
title = 'Deaths from COVID-19 by regions and countries<br>by December 31, 2020',
color_continuous_midpoint=round(np.average(df.Value, weights = df.Value),1))
fig.show()
fig.write_html("Deaths_region&country.html")
Is there a pattern in terms of virus spread between different regions of income? The sunburst graph shows that 'High income' countries and 'Upper medium income' countries cover 42.7% and 36% of total COVID-19 cases respectively; Lower middle and low income countries cover 21.3% of total COVID-19 cases; virus spread in low income countries are not that significant yet, or maybe it just was not diagnosed properly.
# filter the rows for the last day of observation so cumulative values would be maximum and latest
df = df_deaths[df_deaths['Date'] == '12/31/20']
#exclude rows with zero values of 12 countries with no reported deaths
df = df[df['Value'] != 0]
fig = px.sunburst(df, path=['IncomeLevel', 'Country'], values=df.Value,
color = df.Value, color_continuous_scale='Reds',
title = 'Deaths from COVID-19 by income level and country<br>by December 31, 2020',
color_continuous_midpoint=round(np.average(df.Value, weights = df.Value),1))
fig.show()
fig.write_html("Deaths_income&country.html")
import pandas as pd
import pycountry
df_Recovered=pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv")
# Aggregate the dataset
df_Recovered = df_Recovered.drop(columns=['Province/State','Lat','Long'])
df_Recovered = df_Recovered.groupby('Country/Region').agg('sum')
date_list = list(df_Recovered.columns)
df_Recovered['Country'] = df_Recovered.index
Several countries' names are written differently than pycountry expects, so I change their names to match and get the code
df_Recovered.loc[df_Recovered.Country=="Burma",'Country']='Myanmar'
df_Recovered.loc[df_Recovered.Country=="Brunei",'Country']='Brunei Darussalam'
df_Recovered.loc[df_Recovered.Country=="Iran",'Country']='Iran, Islamic Republic of'
df_Recovered.loc[df_Recovered.Country=="Congo (Brazzaville)",'Country']='Congo, The Democratic Republic of the'
df_Recovered.loc[df_Recovered.Country=="Congo (Kinshasa)",'Country']='Republic of the Congo'
df_Recovered.loc[df_Recovered.Country=="Cote d'Ivoire",'Country']="Côte d'Ivoire"
df_Recovered.loc[df_Recovered.Country=="Korea, South",'Country']="Korea, Republic of"
df_Recovered.loc[df_Recovered.Country=="Syria",'Country']="Syrian Arab Republic"
df_Recovered.loc[df_Recovered.Country=="Taiwan*",'Country']="Taiwan, Province of China"
df_Recovered.loc[df_Recovered.Country=="Russia",'Country']='Russian Federation'
df_Recovered.loc[df_Recovered.Country=="West Bank and Gaza",'Country']='Palestine, State of'
df_Recovered.loc[df_Recovered.Country=="Venezuela",'Country']='Venezuela, Bolivarian Republic of'
df_Recovered.loc[df_Recovered.Country=="US",'Country']='United States'
df_Recovered.loc[df_Recovered.Country=="Laos",'Country']="Lao People's Democratic Republic"
As soon as names are unified, I can add their ISO-3 codes.
df_Recovered['ISO-3'] = df_Recovered['Country'].apply(get_country_code)
# View data structure
df_Recovered.head(2)
# Transform the dataset in a long format
df_Recovered = pd.melt(df_Recovered, id_vars=['Country','ISO-3'], value_vars=date_list)
df_Recovered.head(0)
df_Recovered = df_Recovered.rename(columns={"variable": "Date","value": "Value"})
#add Kosovo ISO-3 code as it is not listed in pycountry dictionary
df_Recovered.loc[df_Recovered.Country=="Kosovo",'ISO-3']="XKX"
df_Recovered = df_Recovered.dropna()
# View data shape
df_Recovered.shape
As of January 14, 2021 maximum number of recovered patients in one country were registered in India.
print("Max number of recovered:", df_Recovered.Value.max())
import plotly.express as px
df = df_Recovered
fig = px.choropleth(df, # input dataframe
locationmode='ISO-3', # set of locations used to map 'locations'
locations="ISO-3", # identify country by code
color=np.log10(df['Value']), # identify values and replace linear scale with logarithmic scale
hover_name="Country", # identify column to add as name to hover information
animation_frame="Date", # identify date column
projection="equirectangular", # select projection
hover_data=[df['Value']], # hover text
center = {"lat": 14.883333, "lon": 5.266667},# set map center
color_continuous_scale=px.colors.sequential.Reds, # set color scale, "_r" to reverse color
range_color=[0,round(np.log10(df['Value']).max(),2)], # set the range of dataset
)
#customize layout
fig.update_layout(
title_text='Recovered from COVID-19 by country over time<br>January 22, 2020 - January 14, 2021',
geo=dict(showframe=False, showcoastlines=False, projection_type='equirectangular'),
annotations = [dict(
x=0.8,
y=0.0,
xref='paper',
yref='paper',
text='Source: <a href="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv">\
Johns Hopkins University & Medicine</a>',
showarrow = False
)],
#customize colorbar
coloraxis_colorbar=dict(title='Recovered',
tickvals=[0,1,2, 3, 4, 5, 6, 7], #customize colorbar title and ticks values
ticktext = ['1','10','100', '1K', '10K', '100K', '1M', '10M'] #replace log10 colorbar ticks text
)
)
fig.show()
fig.write_html("Recovered_map.html")
df_Recovered = df_Recovered.merge(df_convert,on='ISO-3')
df_Recovered.head(1)
df_Recovered.isnull().any()
Interactive sunburst graph shows countries within world's regions where the recovery cases of COVID-19 were reported by August 21, 2020.
import numpy as np
import plotly.express as px
# filter the rows for the last day of observation so cumulative values would be maximum and latest
df = df_Recovered[df_Recovered['Date'] == '12/10/20']
#exclude rows with zero values of 12 countries with no deaths reported
df = df[df['Value'] != 0]
fig = px.sunburst(df, path = ['Region', 'Country'], values = df.Value,
color = df.Value, color_continuous_scale='Reds',
title = 'Recovered from COVID-19 by regions and countries<br>by December 10, 2020',
color_continuous_midpoint=round(np.average(df.Value, weights = df.Value),1))
fig.show()
fig.write_html("Recovered_region&country.html")
# filter the rows for the last day of observation so cumulative values would be maximum and latest
df = df_Recovered[df_Recovered['Date'] == '12/10/20']
#exclude rows with zero values of 12 countries with no reported deaths
df = df[df['Value'] != 0]
fig = px.sunburst(df, path=['IncomeLevel', 'Country'], values=df.Value,
color = df.Value, color_continuous_scale='Reds',
title = 'Recovered from COVID-19 by income level and country<br>by December 10, 2020',
color_continuous_midpoint=round(np.average(df.Value, weights = df.Value),1))
fig.show()
fig.write_html("Recovered_income&country.html")
In order to make meaningful comparison of deaths from virus between the countries I need to import population data first, and calculate deaths per million ratio.
import pandas as pd #to work with tabular data
import requests #to access the csv from the url string
import io #to read the csv directly from the url string
url_pop = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv"
raw=requests.get(url_pop).content
df_pop=pd.read_csv(io.StringIO(raw.decode('utf-8')))
# View data shape
df_pop.shape
I expected to see about 200 rows in the dataset (1 row per country), but its shape contains 4153 rows. The reason for that is, as we can see below, that it contains not only population of the countries, but mostly population of provinces, states and cities for some countries.
# View a slice of loaded data
df_pop[78:100]
To avoid double counting I need to clean the dataset from excessive information. To do that, I filter only the rows with values NaN in column "Province_State" using the isnull function. After that I can simply drop "UID", "iso2", "code3", "FIPS", "Admin2", "Province_State", "Combined_Key" columns.
df_pop = df_pop[pd.isnull(df_pop.Province_State)]
df_pop = df_pop.drop("UID", axis=1)
df_pop = df_pop.drop("iso2", axis=1)
df_pop = df_pop.drop("FIPS", axis=1)
df_pop = df_pop.drop("Admin2", axis=1)
df_pop = df_pop.drop("Province_State", axis=1)
df_pop = df_pop.drop("Combined_Key", axis=1)
df_pop = df_pop.drop("code3", axis=1)
Now the dataset contains only population of 191 countries, and 5 columns instead of 12.
df_pop.shape
There were 2 empty cells in population column of the dataframe, they contained data on "MS Zaandam" and "Diamond Princess". Delete them
df_pop = df_pop[df_pop.Country_Region !='MS Zaandam']
df_pop = df_pop[df_pop.Country_Region !='Diamond Princess']
df_pop.shape
df_pop = df_pop.reset_index(drop=True) #reset indexes for reduced number of rows
If I check maximum population, I see population of China.
df_pop.Population.max()
Minimum population is in Vatican ("Holy See").
df_pop.Population.min()
Vatican population is too small to be visible on the map or graph, so it is better to get rid of it in dataset.
df_pop = df_pop[df_pop.Country_Region !='Holy See']
df_pop.shape
Now the dataset has the same number of countries, as the datasets of COVID-19 statistics. In order to use population data for calculation, I unify the "iso3"/"ISO-3" names of columns and merge "Population" column to df_deaths dataframe.
df_pop = df_pop.rename(columns={"iso3": "ISO-3","Country_Region": "Country"})
df_deaths.head(1)
Both dataframes based on the same list of 185 countries.
sorted(df_pop['ISO-3'].unique().tolist()) == sorted(df_pop['ISO-3'].unique().tolist())
df_pop = df_pop.drop(axis=1, columns = ['Country', 'Lat', 'Long_'])
df_deaths_pop = df_deaths.merge(df_pop,on='ISO-3')
df_deaths_pop["Deaths per mln"] = round((df_deaths_pop.Value / df_deaths_pop.Population * 1000000),2)
df_deaths_pop.tail()
print("Deaths per mln, min = ",df_deaths_pop["Deaths per mln"].min())
print("Deaths per mln, max = ",df_deaths_pop["Deaths per mln"].max())
print("Deaths per mln, avg = ",round(df_deaths_pop["Deaths per mln"].mean(),2))
import plotly.express as px
df = df_deaths_pop
fig = px.choropleth(df, # input dataframe
locationmode='ISO-3', # set of locations used to map 'locations'
locations="ISO-3", # identify country by code
color=df['Deaths per mln'], # identify values
hover_name="Country", # identify column to add as name to hover information
animation_frame="Date", # identify date column
projection="equirectangular", # select projection
hover_data=[df['Deaths per mln']], # hover text
center = {"lat": 14.883333, "lon": 5.266667},# set map center
color_continuous_scale=px.colors.sequential.Reds, # set color scale, "_r" to reverse color
range_color=[0,round(df["Deaths per mln"].max(),0)] # set the range of dataset
)
#customize layout
fig.update_layout(
title_text='Deaths from COVID-19 per million people by country over time<br>January 22, 2020 - January 14, 2021',
geo=dict(showframe=False, showcoastlines=False, projection_type='equirectangular'),
annotations = [dict(x=0.8,y=0.0,xref='paper',yref='paper', text='Source: <a href="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data">\
Johns Hopkins University & Medicine</a>', showarrow = False)],
#customize colorbar
coloraxis_colorbar=dict(title='Deaths<br>per M')
)
fig.show()
fig.write_html("Deaths_per_mln_map.html")
This part of the project is based on ideas "Variation in governmentresponses to COVID-19" paper by Oxford:
"As governments continue to respond to COVID-19, it is imperative to study what measures are effective and which are not. While the data presented here do, of course, not measure effectiveness directly, they can be useful input to studies that analyse factors affecting disease progression. OxCGRT seeks to contribute to this knowledge gap by providing comparable measures of individual policy actions, as well as several comparable aggregate indices. We find significant variation in both the measures that governments adopt and when they adopt them. Going forward, governments will benefit from adopting an evidence-based approach to the measures they deploy."
Paper: https://www.bsg.ox.ac.uk/sites/default/files/2020-05/BSG-WP-2020-032-v6.0.pdf
The Oxford Covid-19 Government Response Tracker (OxCGRT) collects systematic information on which governments have taken which measures, and when. The data is daily published at project's GitHub page https://github.com/OxCGRT/covid-policy-tracker
import requests #to access the csv from the url string
import io #to read the csv directly from the url string
import pandas as pd #to work with tabular data
import pycountry #to get the three-letter country codes ISO 3166–1 for each country
url_OxCGRT = "https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv"
raw=requests.get(url_OxCGRT).content
df_OxCGRT=pd.read_csv(io.StringIO(raw.decode('utf-8')))
df_OxCGRT.shape
print("Minimum date in YYYYMMDD format is", df_OxCGRT.Date.min())
print("Maximum date in YYYYMMDD format is", df_OxCGRT.Date.max())
Considering that OxCGRT data becomes available with the gap of 14-30 days, step back for 1 months and choose the date when all the indexes are already calculated, for example, December 1st, 2020.
df_OxCGRT[df_OxCGRT.Date == 20201201].head(3)
# drop excessive columns
df_OxCGRT_indexes = df_OxCGRT.drop(axis=1, columns = ['C1_School closing', 'C1_Flag',
'C2_Workplace closing', 'C2_Flag',
'C3_Cancel public events', 'C3_Flag',
'C4_Restrictions on gatherings', 'C4_Flag',
'C5_Close public transport', 'C5_Flag',
'C6_Stay at home requirements', 'C6_Flag',
'C7_Restrictions on internal movement', 'C7_Flag',
'C8_International travel controls',
'E1_Income support', 'E1_Flag',
'E2_Debt/contract relief', 'E3_Fiscal measures',
'E4_International support',
'H1_Public information campaigns', 'H1_Flag',
'H2_Testing policy', 'H3_Contact tracing',
'H4_Emergency investment in healthcare',
'H5_Investment in vaccines',
'M1_Wildcard',
'StringencyIndex',
'StringencyLegacyIndex',
'StringencyLegacyIndexForDisplay',
'GovernmentResponseIndex',
'ContainmentHealthIndex',
'EconomicSupportIndex'
])
df_OxCGRT_indexes[df_OxCGRT_indexes.Date == 20201201]
The Oxford Covid-19 Government Response Tracker tracks individual policy measures across 17 indicators and calculate several indices to give an overall impression of government activity.
I am particularly interested in 2 aggregated indexes calculated by Oxford:
Each of these indices report a number between 0 to 100 that reflects the level of the governments response along certain dimensions. This is a measure of how many of the relevant indicators a government has acted upon, and to what degree. The index cannot say whether a government's policy has been implemented effectively.
Each index dataframe could be downloaded as separate .csv file from https://github.com/OxCGRT/covid-policy-tracker/tree/master/data/timeseries
df_iGovResp=pd.read_csv("index_governmentresponse.csv")
print("Number of countries with indexes:",len(df_iGovResp.dropna().CountryCode.tolist()))
I will plot the data of United States, Canada, China, Brazil and Russia, and compare those countries' indexes to each other and world's average (calculated as mean of 183 countries available in this dataset). First, identify parts of dataframe I need to plot:
ca_iGovResp = df_iGovResp[df_iGovResp.CountryName == 'Canada'].drop(axis=1, columns = ['CountryCode']) #Canada
ru_iGovResp = df_iGovResp[df_iGovResp.CountryName == 'Russia'].drop(axis=1, columns = ['CountryCode']) #Russia
wld_iGovResp = df_iGovResp.mean(axis=0, skipna=True).drop(axis=1, columns = ['CountryCode']) #"World", 183 countries
cn_iGovResp = df_iGovResp[df_iGovResp.CountryName == 'China'].drop(axis=1, columns = ['CountryCode']) #China
br_iGovResp = df_iGovResp[df_iGovResp.CountryName == 'Brazil'].drop(axis=1, columns = ['CountryCode']) #Brazil
us_iGovResp = df_iGovResp[df_iGovResp.CountryName == 'United States'].drop(axis=1, columns = ['CountryCode']) #United States
import matplotlib.pyplot as plt
import matplotlib.style as style
import numpy as np
import seaborn as sns
%matplotlib inline
SMALL_SIZE = 12
MEDIUM_SIZE = 14
BIGGER_SIZE = 16
plt.rc('font', size=MEDIUM_SIZE) # controls default text sizes
plt.rc('axes', titlesize=BIGGER_SIZE) # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE) # fontsize of the x and y labels
plt.rc('xtick', labelsize=MEDIUM_SIZE) # fontsize of the tick labels
plt.rc('ytick', labelsize=MEDIUM_SIZE) # fontsize of the tick labels
plt.rc('legend', fontsize=MEDIUM_SIZE) # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE) # fontsize of the figure title
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
plt.figure(figsize=(15,8))
plt.style.use('seaborn-whitegrid')
plt.title('Government response index \n January 1, 2020 - January 14, 2021')
plt.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .7)
plt.tick_params(axis = 'both', which = 'major')
x = np.linspace(0, 380, 380) # x axis start point, end point and number of intervals
xlim = (0, 380) # y axis start point, end point
y_pos = np.arange(len(wld_iGovResp))
# use the plt.xticks function to custom labels
plt.xticks(y_pos, color='black', rotation=False)
plt.xticks(np.arange(0, 380, 20.0))
plt.yticks(np.arange(10, 110, 10))
plt.xlabel('Days since January 1, 2020')
plt.ylabel('Government response index (max = 100)')
y1 = np.array(ca_iGovResp.drop(axis=1, columns = 'CountryName').values.tolist()[0])
y2 = np.array(us_iGovResp.drop(axis=1, columns = 'CountryName').values.tolist()[0])
y3 = np.array(cn_iGovResp.drop(axis=1, columns = 'CountryName').values.tolist()[0])
y4 = np.array(br_iGovResp.drop(axis=1, columns = 'CountryName').values.tolist()[0])
y5 = np.array(wld_iGovResp)
y6 = np.array(ru_iGovResp.drop(axis=1, columns = 'CountryName').values.tolist()[0])
ca = plt.plot(x, y1, label='Canada',c="xkcd:leaf green", lw=3, animated =True)
us = plt.plot(x, y2, label='USA', c="royalblue", lw=3, animated =True)
cn = plt.plot(x, y3, label='China', c="mediumturquoise", lw=3, animated =True)
br = plt.plot(x, y4, label='Brazil', c="xkcd:pink", lw=3, animated =True)
wld = plt.plot(x, y5, label='World', ls='--',c='xkcd:dark grey', lw=3, animated =True)
ru = plt.plot(x, y6, label='Russia', c="xkcd:tomato", lw=3, animated =True)
plt.axis([0, 380, 0, 100])
plt.axhline(0, c='black', ls='-', lw=2)
plt.axvline(0, c='black', ls='-', lw=2)
plt.legend(loc='upper left', frameon=True, fancybox=True, framealpha=0.9, facecolor='whitesmoke', ncol=3)
plt.show()
The graph shows government response index to COVID-19 for period since January 1st till August 16th.
Government of China was the first one that responded to COVID-19, dealing with the outbreak first identified in Wuhan in December 2019. It took actions 50-70 days ahead of the rest of the world. Measures taken were more strict in terms of methods, to prevent more damage from virus. Both developing countries (like Brazil and Russia) and developed countries (United States and Canada) prefered not to make unpopular and costly decisions and not enforce strict measures to prevent virus spread till mid-March.
Brazil government reaction was relatively mild in comparison with both all of the rest plotted countries and world's average. Russian government took more strict measures and keeps that level above world's average since March.
data_all = df_deaths_pop #start aggregating data to single dataframe
data_all = data_all[data_all.Date == '12/31/20'] #filter the latest day to research
print(data_all.shape)
data_all = data_all.rename(columns = {'Value':'Deaths',
'IncomeLevel':'Income',
'Death_per_mln':'Deaths per mln'})
print(data_all.shape)
#merge Cases to data_all
Cases = df_cases[df_cases['Date']=='12/31/20']
Cases = Cases[Cases['Value']>0]
Cases = Cases.drop(axis=1, columns = ['Country','Date','Region','IncomeLevel','Country_WB'])
Cases = Cases.rename(columns = {'Value':'Cases'})
data_all = data_all.merge(Cases, on = "ISO-3")
#Calculate Cases per mln of population
data_all['Cases per mln'] = data_all.Cases / data_all.Population *1000000
print(data_all.shape)
data_all.head(1)
I explore the correlation between columns.
print("Q: Do cumulative cases correlate with cumulative deaths over time?")
data_all_nonulls = data_all[data_all.Deaths != 0] #filter out 12 countries that have not reported deaths
r = data_all_nonulls.Deaths.corr(data_all_nonulls.Cases)
r = round(r,2)
print("A: Yes, correlation is positive and strong, standard correlation coefficient is",r)
# Cases vs Deaths linear plot
import plotly.express as px
df = data_all_nonulls
fig = px.scatter(df, x=df['Cases'], y=df['Deaths'],
color='Region', size=round(df['Cases per mln'],1),
hover_data=['Country'],
labels={'x':'Cases', 'y':'Deaths', 'size':'Cases per mln'},
)
fig.update_layout(plot_bgcolor = 'whitesmoke')
fig.show()
print("Q: log(Deaths) correlate with log(Cases)?")
r = (np.log(data_all_nonulls.Deaths).corr(np.log(data_all_nonulls.Cases), method = "pearson"))
r = round(r,2)
print("A: Yes, correlation is positive and very strong, r =", r)
data_all_nonulls
The next scatter plot shows 3 indicators: number of cases (axis x), number of daeths (axis y) and number of deaths per million population (size of the bubble).
# log(Cases) correlation with log(Deaths), logarithmic plot
# Cases vs Deaths
import plotly.express as px
df = data_all_nonulls
fig = px.scatter(df, x=df['Cases'], y=df['Deaths'],
color='Region', size=round(df['Cases per mln'],1),
hover_data=['Country'],
labels={'x':'Cases', 'y':'Deaths', 'size':'Cases per mln'}
)
fig.layout
fig.update_layout(xaxis_type="log", yaxis_type="log",
height=600, width=1000,
title_text="Correlation between cases and deaths",
showlegend=True,
legend=dict(yanchor="top",
y=0.96,
xanchor="left",x=0.02),
plot_bgcolor = 'whitesmoke'
)
fig.show()
print("Q: Deaths correlate with Deaths per mln?")
r = data_all_nonulls.Deaths.corr(data_all_nonulls['Deaths per mln'], method = "pearson")
r = round(r,2)
print("A: Positive correlation is insignificant, r =", r)
print("Q: log(Deaths) correlate with log(Deaths per mln)?")
r = np.log(data_all_nonulls.Deaths).corr(np.log(data_all_nonulls['Deaths per mln']),method = "pearson")
r = round(r,2)
print("A: Yes, positive correlation is quite strong, r =", r)
#Creating dataframe on maximum government response index
resp = pd.DataFrame(df_iGovResp.max(axis=1))
ind = pd.DataFrame(df_iGovResp.CountryCode)
Response = (resp.merge(ind,
left_index=True,
right_index=True)).rename(columns={"CountryCode": "ISO-3",
0: "Response"})
print(Response.shape)
# Merging maximum government response index into data_all, ignoring 0 deaths
data_all_nonulls = data_all[data_all.Deaths != 0] #filter out countries if not reported deaths
data_all_nonulls = data_all_nonulls.merge(Response, on = "ISO-3")
print(data_all_nonulls.shape)
print("Q: Do total cumulative deaths correlate with maximum government response?")
r = (np.log(data_all_nonulls.Deaths).corr(data_all_nonulls.Response, method = "pearson"))
r = round(r,2)
print("A: Positive correlation is insignificant, r =", r)
Total deaths are not correlated to maximum government response, as many countries started to take measures before epidemic led to deaths number growth in order to prevent the losses, and many of them keep high level of restrictions after the deaths number starts to decrease.
print("Q: Do maximum deaths per mln correlate with maximum government response index?")
r = round(data_all_nonulls['Deaths per mln'].corr(data_all_nonulls.Response),2)
print("A: No, r =", r)
print("Q: Does total cumulative number of cases correlate with maximum government response?")
r = (np.log(data_all_nonulls['Cases']).corr(data_all_nonulls.Response, method = "pearson"))
r = round(r,2)
print("A: No, r =", r)
# Merging Response and Government health expenditure per capita 2017, international $
import world_bank_data as wb
df_govHealth = pd.DataFrame(wb.get_series('SH.XPD.GHED.PP.CD',date='2017',id_or_value="id",simplify_index=True).dropna())
df_govHealth['ISO-3'] = df_govHealth.index
Response_Funding = Response.merge(df_govHealth, on = 'ISO-3')
print("Q: Does government response index correlate with government health expenditure?")
r = round(Response_Funding['Response'].corr(Response_Funding['SH.XPD.GHED.PP.CD']),2)
print("A: No, there is no correlation, r =", r)
# log(Cases) correlation with Response over time
#RESPONSE Jan 22-Jan 14
y1[21:] #ca_iGovResp Jan 22-Aug 16
y2[21:] #us_iGovResp Jan 22-Aug 16
y3[21:] #cn_iGovResp Jan 22-Aug 16
y4[21:] #br_iGovResp Jan 22-Aug 16
#y5[21:] #wld_iGovResp Jan 22-Aug 16
y6[21:] #ru_iGovResp Jan 22-Aug 16
#print(len(y1[21:]))
#CASES Jan 22-Jan 14
cases_part = df_cases
#print(len(cases_part.Date.unique()))
x1 = np.array(cases_part.Value[cases_part['ISO-3']=='CAN'])
x2 = np.array(cases_part.Value[cases_part['ISO-3']=='USA'])
x3 = np.array(cases_part.Value[cases_part['ISO-3']=='CHN'])
x4 = np.array(cases_part.Value[cases_part['ISO-3']=='BRA'])
#x5 = np.array(cases_part.groupby('Date')['Value'].mean()) #world's average, !sorted
x6 = np.array(cases_part.Value[cases_part['ISO-3']=='RUS'])
Canada = pd.DataFrame(x1,y1[21:])
Canada['Response'] = Canada.index
Canada['Country'] = "Canada"
Canada = Canada.rename(columns={0: "Cases"}).reset_index(drop=True)
USA = pd.DataFrame(x2,y2[21:])
USA['Response'] = USA.index
USA['Country'] = "USA"
USA = USA.rename(columns={0: "Cases"}).reset_index(drop=True)
China = pd.DataFrame(x3,y3[21:])
China['Response'] = China.index
China['Country'] = "China"
China = China.rename(columns={0: "Cases"}).reset_index(drop=True)
Brazil = pd.DataFrame(x4,y4[21:])
Brazil['Response'] = Brazil.index
Brazil['Country'] = "Brazil"
Brazil = Brazil.rename(columns={0: "Cases"}).reset_index(drop=True)
Russia = pd.DataFrame(x6,y6[21:])
Russia['Response'] = Russia.index
Russia['Country'] = 'Russia'
Russia = Russia.rename(columns={0: "Cases"}).reset_index(drop=True)
To find HEX codes for this colors (Canada 'leaf green', USA 'royalblue', China 'mediumturquoise", Brazil 'pink', Russia 'tomato'), view the list of colors from mathplotlib:
import matplotlib
colorname = []
colorid = []
for name, hex in matplotlib.colors.cnames.items():
colorname.append(name)
colorid.append(hex)
zippedcolors = list(zip(colorname, colorid))
zippedcolors = sorted(zippedcolors, key=lambda x: x[1])
#zippedcolors #uncomment to view the list of colors with id
# log(Cases) correlation with (Government response index) over time
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(rows=1, cols=1)
fig.append_trace(go.Scatter(x = Canada['Cases'],
y = Canada['Response'], name="Canada",
mode="lines", line=dict(color='#5ca904')),
row=1, col=1)
fig.append_trace(go.Scatter(x = USA['Cases'],
y = USA['Response'], name="USA",
mode="lines", line=dict(color='#4169E1')),
row=1, col=1)
fig.append_trace(go.Scatter(x = China['Cases'],
y = China['Response'], name="China",
mode="lines", line=dict(color='#48D1CC')),
row=1, col=1)
fig.append_trace(go.Scatter(x = Brazil['Cases'],
y = Brazil['Response'], name="Brazil",
mode="lines", line=dict(color='#FFC0CB')),
row=1, col=1)
fig.append_trace(go.Scatter(x = Russia['Cases'],
y = Russia['Response'], name="Russia",
mode="lines", line=dict(color='#FF6347')),
row=1, col=1)
fig.layout
fig.update_layout(xaxis_type="log", height=600, width=1000,
title_text="Cases vs government response",
showlegend=True,
legend=dict(yanchor="top",
y=0.99,
xanchor="left",x=0.01),
legend_title_text='Country',
plot_bgcolor = 'whitesmoke'
)
fig.show()
I have confirmed all 3 hypothesis:
I found out that positive correlation between cases of COVID-19 and deaths from it is very strong, as well as between confirmed cases and government response. Government response index and government healthcare expenditure in previous years are not correlated. These findings are relevant for different regions, but their relevancy for the lower middle income countries and low income countries is not fully confirmed yet due to limitations of the dataset. The code I wrote could be used for daily motinoring of the situation and government response to it.
It is too early to make final conclusions about correlations of COVID mortality for different regions or income levels as disease statistics database is only forming and growing, and information available now is incomplete.
The indicators that we could use to describe the situation unfolding now are work in progress.